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1 wish to emphasize at the outset that I am considering onl^^.the 
domain of achievement tests, rather than other domains such as those of 
personality tests and of intelligence tests. Further, I am considering only 
achievement tests that are specifically linked to an instructional program and 
have Seen developed in relation to an objectives base and/or to an item 
generation rule. These may or may not be criterion-referenced tests, depend- 
upon the definition employed for that term; however, they quite likely are 
tests for which there is an interpretation of a particular student's 
obtained score, that does not depend on knowledge of the scores of any o^er 
s-tudent. I think of mastery tests as falling into this category, and I \ 
have discussed some technical characteristics of such tests in one of the 
Center's monograohs. (Harris, 1974) 

Today I wish to report on an inquiry which is now underway but not 
completed. The inquiry is an attempt to examine the grounds and methods 
for studying student response data to the type of test I am considering. 
Such study of student response data is intended to throw light on the complex 
of instructional programs plus test development and interpretation. This 
differs from the typical practice of finding numbers to be used to choose 
itens From an undefined or accidental pool of items on the grounds that such 
nunbers -^lean that these items will work well in a particular sample of 
stuJents ^'lose instructional history is notknownor possibly not considered 
rel evar.i.: 

Let us assume that, for the type of test that I am considering, a 
pool of iteiis has been carefully conceptualized and constructed to represent 
the ber.aviors that the instructional program is designed to foster and 
that rules have been developed for sampling this pool of items in such a way 



as to yield aggregates of items for which one or more instructionally relevant 
scores can be developed. It seems reasonable to require that such sampling 
have a random character but it may of course operate witi.in stratz or cells. 
Such a base defines a universe of items and a univer-se'of test scores based 
upon appropriate samples of these items. Let us further,assume that we would 
like to use student response data to study both the instructional program and 
the test development and test interpretation process. We have identified 
several types of studies that seem to be fruitful; not surprisingly, some of 
these studies are rather standard ones, '^l shall outline three types of 
studies and for each one illustrate the kind of procedure we believe wiTl., 
be aporopriate for the purposes described above. 

The notion of stability, which can be related to the concept of specific 
reliability, is of import^ance.' What we would like to have is an estimate of = 
an appropriate stability coefficient for an item and a coefficient for a 
, .core from which one could describe generally this characteristic of the 
pool of Items and of the universe of test scores that can be derived from 
the.e items. Let me illustrate at the level of the item. If, for a population 
of students Whose instructional history has been controlled, an item varies 
markedly in difficulty (normative difficulty for this population) over 
administrations separated by brief time periods during which no additional 
instruction is given, then we have evidence tha. the combination of instruc- 
tion and test development process yields undependable item data.. Such a 
r/ndin^i for a random sample of the items would bP grounds for reworking the 
mstr^iction and the test development process, with the hope of finding clues 
as to wny the items oehaved so badly. There would be several places to look. 
Tor exai.ple the item type or format may be so unfamiliar to this population of 
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-,tud2nts to introduce a factor of learning that systematically luakeb the 
ito"i eaiier on the second trial. 

One can jse McNemar' s chi square procedure (or the underlying exact 
procedurr tnat employs the binomial) to test the hypothesis that the difficulty 
of the item is the same for the two administrations. This is very-useful, and 
it is a proper test for this purpose; the more familiar use of chi squire for 
a test of independence seems to me to be all wrong here. But the McNemar 
test probably isn't sufficient. We would also lil<e an estimate of the common 
difficulty level of the item and we would like to be able to aggregate such 
estimates to secure a meaningful index to describe the pool of items on the 
basis 0' a study of a sample of these items. We would also like an estimate 
of fho degree of association {of some aspect of association) that can be 
aggregated in a similar manner. We are now exploring a statistic devised by 
Lazarsfeld and Kendall and^rpported by Goodman and Kruskal {1959, P. 149-150), 
using some Monte Carlo methods to examine its sampling distribution. In 
tisne we will know whether or not to reconwnend it. 

. A second important hotion is that of equivalence, which can be related 
to the Loncept of generic reliability. I illustrate aga^n with the case of 
the study of a pair of items. In such a study one may or may not expect 
the tv.o item difficulties to be the same for a specified population of 
stude.'.t. for wnoni tne instructional history has been controlled, and so a 
test of the Hypothesis of identical difficulties may or may not bo informative 
Lven if such a test is infomative, however, one would like estimates of 
diffiLJity for the two items and some measure of association that might be 
...eaninrjrully aggregated. We have found important leads in Goodman and 
Kru.kal (i9-:9) and in the fairly new volume by Fleiss (1973) and are loot -.nj 
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into sampling characteristics for these measures. For both stability and 
equivalence Item studies, the, appropriate sampling design is Fleiss' . 
Method I. 

A third type of study is that of sensitivity to instruction. If the 
instructional program is effective ana if the test development process has 
yielded items and test scores that measure the outcomes of the instruction 
adequately, then one expects that .the items and/or the test scores will be 
sensitive to instruction. If they are not, then again something is wrong 
and one must begin a search for the defect, which may be in either or both 
the instruction or the test development. In studying sensitivity to instruc- 
tion of an item, more than one experimental and sampling design is avail- 
able, and the statistic one would employ to measure sensitivity to instruction 
may differ with the different designs. If we choose a sample of students to 
whor we administer the item, whom we then teach, and to whom we then readmin- 
ister the item, we have fixed the total sample size but not the marginals in 
the two-by-two table, and we have introduced an experimental manipulation 
that is intended to change the difficulty of the item. With such a design 
the usual Chi square test of independence and the related phi coefficient 
are inappropriate. Instead, one would like a measure of the amount of change 
attributable to instruction, and this can be derived from the appropriate 
conditional probability which can be estimated by determining the proportion 
of those who failed the item on the first administration who passed it on 
the second administration. It also is possible to introduce a model of 
measurfMt.nt error for the responses and develop a modified estimate of this 
conriitionai probability corrected for measurement error. 
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A5 we study the various methods that have been suggested tor examining 
stability, equivalence, and sensitivity to instruction of both test items 
and test scores. we are attempting to coordinate three th'ings: sampling 
procedure and experimental design, choice of a statistic, and method of 
aggregating' the statistic so as to provide generalizations for the pool of 
items or the universe of tost scores. We hope to have a number of specific 
results that can be summarized in a forthcoming issue of the Center 
nionograph series. 
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